Class 31

DATA1220-55, Fall 2024

Sarah E. Grabinski

2024-11-20

Review: 2 Numeric Variables

Review: Describing Associations

  • Independence: an increase in \(X\) is not associated with a change in \(Y\)
  • Positive association: an increase in \(X\) is associated with an increase in \(Y\)
  • Negative association: an increase in \(X\) is associated with a decrease in \(Y\)
  • Weak association: data points are very far apart from each other
  • Strong association: data points are tightly clustered

Pratice

Which image shows a positive relationship between the explanatory and response variables?

Income vs Education

Age vs Survival

Practice

Which image shows a strong relationship between the explanatory and response variables?

Correlation

  • Describes the direction and strength of the association between 2 numeric variables
  • A correlation ranges from -1 to 1

    • A perfect negative correlation equals -1

    • A perfect positive correlation equals 1

  • A correlation of 0 indicates the two variables are independent (no relationship)
  • Different techniques for linear (Pearson) vs non-linear (Spearman) relationships

Linear vs Non-Linear

Interpreting Correlations

High Low High Low Perfect Perfect 1 0.9 0.5 0 -0.5 -0.9 -1 Positive Positive Negative Negative No Positive Negative Correlation Correlation Correlation Correlation Correlation Correlation Correlation

Example: Poverty vs Graduation Rate

What’s the response variable?

Response Variable: Percent of people in poverty

Example: Poverty vs Graduation Rate

What’s the explanatory variable?

Explanatory variable: Percent of people who graduated high school

Example: Poverty vs Gradution Rate

Describe the relationship between these 2 variables.

Relationship: linear, negative, moderate to strong

Example: Poverty vs Graduation Rate

Which of the following is the most likely correlation? A. 0.60 B. -0.25 C. -0.75 D. 0.35

Describe the relationship between these 2 variables.

Example: Poverty vs Graduation Rate

Which of the following is the most likely correlation? C. -0.75

Describe the relationship between these 2 variables.

Testing a Correlation

  • Null Hypothesis: The two variables are independent (correlation = 0)

\[ H_0 \colon \rho=0 \]

  • Alternate Hypothesis: the two variables are dependent

\[ \begin{aligned} H_A &\colon \rho > 0 \\ & \rho < 0 \\ & \rho \ne 0 \\ \end{aligned} \]

Test Statistic

The test statistic \(t\) for the population Pearson correlation \(\rho\) (Greek letter rho) is estimated using the observed correlation \(r\).

\[ t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \]

Getting a p-value

Use the Student’s \(t\) distribution with degrees of freedom \(\text{df}=n-2\) to find a p-value for the observed correlation \(r\) in a sample of size \(n\) under the null hypothesis \(H_0 \colon \rho=0\).

# specify the test statistic and degrees of freedom
pt(test_statistic,
   df = n-2,
   lower.tail = F) # optional parameter

Eyeballing a Line

How do we find the best line to draw through variables that appear to have a linear relationship?

Quantifying Error: Residuals

Residuals are the difference between the observed values and the predicted values.

Example: Correlation vs Causation

Do ice cream sales cause drowning incidents?

Example: Correlation vs Causation

  • As ice cream sales increase, the number of drownings increases

. . .

  • Strong, positive correlation

. . .

  • High temperatures increase both ice cream consumption and the number of people swimming